Name: Group9-Jobless Jesters
Course: BUDT704
Section: 0502
Date: December 6, 2023

Job Seeker’s Data Disco🪩¶

Introduction¶

Overview:¶

In an era where job markets are rapidly evolving, understanding the dynamics of job postings on prominent platforms such as LinkedIn is crucial. Our project “Job Seeker’s Data Disco”, focuses on this vital area, offering deep insights into the current landscape of job opportunities. We hope this will help college students to get an understanding of their professional journey ahead.

Age of Remote Work:¶

The remote work boom during the pandemic has transformed how industries operate and what they can expect from their employees and how they can better manage resources. Our project examines how these changes are reflected in the LinkedIn job postings, providing a window into the demand for skills tailored to remote and hybrid work environments.

Analytical Approach:¶

Our analysis goes beyond counting job advertisements. It examines the nature and quality of engagement these postings receive, revealing where the real opportunities for job seekers lie among industries and companies highlighting the most dynamic sectors.

Key Focus Areas:¶

Our project aims to uncover:

  1. Trends in remote work: How the prevalence of remote work in job postings varies across different locations and industries.

  2. Job market dynamics: Understanding the gap between job demand and supply.

  3. Popularity and engagement: Understanding which job types are gaining most traction and engagement.

  4. Skillset Analysis: Identifying the most in demand skills, including technical and soft skills crucial for today’s job market.

Our research questions are as follows:¶

  • How has the prevalence of remote work in job postings changed at different locations and are there differences between the industries?

  • Which job type has the highest number of postings and what jobs are applied most, is there a gap between the demand and supply and why is that?

  • How does the traction (viewed) compare to engagement (applied) look like for different jobs. What does that tell about different companies and job type’s popularity?

  • Which skills are most wanted across all industries and why is that?

Milestones and Progress¶

Task Task Lead Due Date Status Reasoning/Planning
Improvement over Project Proposal Jeet N/A Completed N/A
Performing Exploratory Data Analysis Aarya Oct 18 Completed N/A
Data Interpretation Ning Oct 23 Completed N/A
Data Integration Prashant Oct 25 Completed N/A
Filtering unwanted Rows and Columns Hetvi Oct 27 Completed N/A
Imputing required data Prashant Oct 28 Completed N/A
Managing NULL values Aarya Oct 28 Completed N/A
Data Reshaping Ning Nov 2 Completed N/A
Data Quality Check Akshat Nov 7 Completed N/A
Project Update Hetvi Nov 10 Completed N/A
Text Processing for Locations Ning Nov 12 Completed N/A
Text Processing for Skills data Aarya Nov 12 Completed N/A
Data Merging Prashant Nov 14 Completed N/A
Text Processing for Job Titles Jeet Nov 14 Completed N/A
Data Analysis - Question 1 Prashant Nov 20 Completed N/A
Data Analysis - Question 2 Jeet Nov 20 Completed N/A
Data Analysis - Question 3 Aarya Nov 20 Completed N/A
Data Analysis - Question 4 Hetvi Nov 20 Completed N/A
Inferences Rohan Nov 24 Completed N/A
Markdown Akshat Nov 28 Completed N/A
Project Report Hetvi Dec 4 Completed N/A

Choice for Heavier Grading¶

Our project should be graded more heavily on Data Analysis for the following reasons:

  • We used advanced techniques to create custom visualizations to analyze our data, making sure our findings were based on in-depth analysis rather than basic processing.
  • The analysis offers deep insights into the current landscape of job opportunities.
  • It will help the job applicants get an understanding of their professional journey ahead.
  • It will help understand the applicants where the real opportunities lie and which sectors are experiencing the most dynamic growth.
  • The visualizations will help in understanding the current market trends and skills required to excel in the job market.

Libraries Required¶

We are importing the necessary python libraries that will be required for the project. The following are the libraries required:-

  • Pandas
  • Numpy
  • Regular Expressions
  • Matplotlib
  • Seaborn
  • Plotly

Matplotlib, Seaborn and plotly are required for visualization.

In [1]:
# Importing the required libraries
import pandas as pd
import numpy as np
import re
import plotly.express as px
import matplotlib as plt
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

Job Posting Data Set¶

From the Kaggle site, we got a total of 8 csv files.

First we load the job_posting dataset. The dataset has a total of 27 columns and 15886 rows. The dataset talks about the various job postings that were done on LinkedIn by various companies, the job position offered in the posting, the description put by the company, the maximum and the minimum salary offered for that job role, how the work will be, the location of the office, how many people applied for the job, whether remote work is allowed, how many people viewed the job posting, the company url where the job posting was done, the experience that a person is recommended to have, pre-skills or qualifications (if the company requires them), whether the job is sponsored, how the job is going to be (full time, part time or contract based), and currency in which the employees will get paid among other columns.

In [2]:
# Loading the job_postings excel file using pr.read_csv
job_posting_df = pd.read_csv('job_postings.csv')
job_posting_df.head(2) # Printing the entire job_posting dataset
Out[2]:
job_id company_id title description max_salary med_salary min_salary pay_period formatted_work_type location ... expiry closed_time formatted_experience_level skills_desc listed_time posting_domain sponsored work_type currency compensation_type
0 85008768 NaN Licensed Insurance Agent While many industries were hurt by the last fe... 52000.0 NaN 45760.0 YEARLY Full-time Chico, CA ... 1.710000e+12 NaN NaN NaN 1.690000e+12 NaN 1 FULL_TIME USD BASE_SALARY
1 133114754 77766802.0 Sales Manager Are you a dynamic and creative marketing profe... NaN NaN NaN NaN Full-time Santa Clarita, CA ... 1.700000e+12 NaN NaN NaN 1.690000e+12 NaN 0 FULL_TIME NaN NaN

2 rows × 27 columns

Data Cleaning¶

After performing EDA we got to know that there are many columns in the job_posting file which will not be useful to us in the analysis. Columns such as Max_salary, Min_salary can be eleminated as we are focusing on the analysis of number of Job posting with respect to Location and Industries.

Dropping of Columns in Job Postings¶

This logic is used to remove several columns from the 'job_posting_df' dataset, streamlining the dataset by dropping columns that are considered unnecessary for the analysis or might not contribute significantly to the related process.

In [3]:
# dropping the unnecessary columns present in the dataset
job_posting_df.drop(['description', 'max_salary','med_salary','min_salary','pay_period','original_listed_time','job_posting_url','application_url','application_type','expiry','closed_time','formatted_experience_level','skills_desc','listed_time','posting_domain','work_type','currency','compensation_type'], axis=1, inplace=True)

Employee Count Dataset¶

Now we load the 'employee counts' dataset. The dataset has a total of 4 columns and 15907 row. The dataset gives details on the company. The dataset talks about how many employees a company has, how many followers a company has on LinkedIn and the time at which this was recorded. The company is identified with a company id.

In [4]:
# loading the employee counts excel file using pd.read_csv
employee_counts_df = pd.read_csv('employee_counts.csv')

Handling Duplicates¶

Employee_counts had a lot of duplicated data so we will be deleting the duplicates in order to achieve consistency in the data. It will also ensure that each company's data remains unique.

In [5]:
# dropping the duplicates that are present in the company id colmn of the employee counts dataset
employee_counts_df.drop_duplicates(subset='company_id',inplace=True)

Merging Employee_counts with job_postings¶

Merging the above mentioned columns will help us to understand the work environment by taking into account of the company size. This can help the students to get some valuable insights when they look out for jobs in the companies on they target.

In [6]:
# merging the employee counts dataset and the job postings dataset using pd.merge on the company id column
job_posting_merged_df = pd.merge(job_posting_df, employee_counts_df, on='company_id', how='left')
In [7]:
# dropping the unnecessary time_recorded column using .drop()
job_posting_merged_df.drop('time_recorded',axis=1,inplace=True)

Checking Null Values¶

We'll be checking for the null values in the merged data frame and take the necessary steps in dealing with them.

In [8]:
# printing the count of null values in each column
job_posting_merged_df.isnull().sum()
Out[8]:
job_id                     0
company_id               366
title                      0
formatted_work_type        0
location                   0
applies                 7186
remote_allowed         13546
views                   2763
sponsored                  0
employee_count           366
follower_count           366
dtype: int64

Handling Null Values¶

As we can see from the output above, 'applies', 'remote_allowed' and 'views' have null values.

  • Applies Column: In the applies column, we are replacing the NaN with 0 because there are no applications made by the users on that particular job posting through LinkedIn. This is an assumption made by us since the value is missing.
  • Remote_allowed Column: This can be treated as a dummy variable where 1 represents that the job that is posted is allowing to work remotely. Whereas, null values can be filled with 0s which will represent that remote working is not allowed.
  • Views: In the views column, we are replacing the NaN with 0 because there are no user has viewed the job posting posted by that particular company on LinkedIn. This is an assumption made by us since the value is missing.
In [9]:
# filling the rows having NaN with 0 in the applies column using .fillna()
job_posting_merged_df['applies'].fillna(0,inplace=True)
In [10]:
# filling the rows having NaN with 0 in the remote _allowed column using .fillna()
job_posting_merged_df['remote_allowed'].fillna(0,inplace=True)
In [11]:
# filling the rows having NaN with 0 in the views column using .fillna()
job_posting_merged_df['views'].fillna(0,inplace=True)
In [12]:
# printing the count of null values in each column after replacing the rows having NaNs wiht 0s in the applies, remote_allowed and views columns
job_posting_merged_df.isnull().sum()
Out[12]:
job_id                   0
company_id             366
title                    0
formatted_work_type      0
location                 0
applies                  0
remote_allowed           0
views                    0
sponsored                0
employee_count         366
follower_count         366
dtype: int64
In [13]:
# printing the updated dataset
job_posting_merged_df.head()
Out[13]:
job_id company_id title formatted_work_type location applies remote_allowed views sponsored employee_count follower_count
0 85008768 NaN Licensed Insurance Agent Full-time Chico, CA 0.0 0.0 5.0 1 NaN NaN
1 133114754 77766802.0 Sales Manager Full-time Santa Clarita, CA 0.0 0.0 0.0 0 15.0 159.0
2 133196985 1089558.0 Model Risk Auditor Contract New York, NY 1.0 0.0 17.0 0 48.0 14476.0
3 381055942 96654609.0 Business Manager Full-time Forney, TX 0.0 0.0 0.0 0 0.0 0.0
4 529257371 1244539.0 NY Studio Assistant Full-time New York, NY 0.0 0.0 2.0 1 87.0 9790.0

Data Quality Check¶

After we have finished the processing we can see that the data is pretty much consistent. We have identified and handled the missing values and redundancy in our data set using filtering, imputing and merging the datasets. We have achieved some of the following things with the data cleaning:

  • Integrity
  • Consistency
  • Correctness of datatypes
  • Standardization
  • Reduced Redundancy

However, we will go through the data again and try to refine it and make it more consistent.

Job Skills Dataset¶

Now we load the 'job skills' dataset. The dataset has a total of 2 columns and 27899 rows. The dataset gives details of the skills thata are required for a particular job posting. The dataset talks about the different set of skills a person must have if he/she is applying tothat company for that particular job post. It contains of job_id from the job_postings dataset and a skills column.

In [14]:
# loading the job_skills dataset using .read_csv
skills = pd.read_csv('job_skills.csv')

So now we will transform the pivot the dataframe by enumerating skills required for each job_id and then pivoting the skill column to create separate skill columns for each job.

In [15]:
# creating a new column skill and enumerating the skills
skills['skill']= skills.groupby("job_id")["skill_abr"].cumcount() + 1
In [16]:
# transform each value in the 'skill' column by prefixing it with 'Skill' and attaching a number to it
skills['skill']= skills['skill'].apply(lambda x: f'Skill {x}')
skills.head()
Out[16]:
job_id skill_abr skill
0 3690843087 ACCT Skill 1
1 3690843087 FIN Skill 2
2 3691763971 MGMT Skill 1
3 3691763971 MNFC Skill 2
4 3691775263 MGMT Skill 1
In [17]:
# creating a pivot table using .pivot to transform the skills column into 3 columns representing a different skill
skill_pivot =skills.pivot(index='job_id',columns='skill',values='skill_abr')
skill_pivot.head()
Out[17]:
skill Skill 1 Skill 2 Skill 3
job_id
85008768 SALE BD NaN
133114754 SALE BD NaN
133196985 ACCT FIN NaN
529257371 DSGN ART IT
903408693 ADM NaN NaN

Merging job_posting_merged_df with skill_pivot¶

Now, we merge the merge two dataframes that is 'job_posting_merged_df' and 'skill_pivot'. Both the columns has a common column - job_id, so we merge the two dataframes on that column and we do use a left join. This means that all the rows from the job_posting_merged_df dataframe are included in the resultant dataframe and only the matching relavant rows in the job_id column of the company dataset are included.

Merging the above mentioned dataframes will help us to understand what skills a company requires for a particular job while having all the details related to the job. This can help the students to get some valuable insights as to which skill they need to develop in order to apply for that job.

In [18]:
# merging the two dataframes using .merge and using left join
job = pd.merge(job_posting_merged_df, skill_pivot, on='job_id', how='left', suffixes=('','skill_'))

Companies Dataset¶

Now we load the 'companies' dataset. The dataset has a total of 10 columns and 994 rows. The dataset gives details of the companies. The dataset talks about all the companies who have posted jobs on LinkedIn giving all the details required when a person is applying for a job.

In [19]:
# loading the companies dataset using .read_csv
company = pd.read_csv('companies.csv')
In [20]:
# checking for duplicates in the dataset
company.duplicated().unique()
Out[20]:
array([False])

Merging Job and Company¶

Now, we merge the merge two dataframes that is 'job' and 'company'. Both the columns has a common column - company_id, so we merge the two dataframes on that column and we do use a left join. This means that all the rows fromthe job dataframe are included in the resultant dataframe and only the matching relavant rows in the company_id column of the company dataset are included.

Merging the above mentioned dataframes will help us with the . This can help the students to get some valuable insights as to which skill they need to develop in order to apply for that job.

In [21]:
# merging "job" and "company" data on the "company_id" with a left join. In this way we will retain all entries from "Job"
df_job = pd.merge(job, company, on='company_id', how='left')
df_job.head()
Out[21]:
job_id company_id title formatted_work_type location applies remote_allowed views sponsored employee_count ... Skill 3 name description company_size state country city zip_code address url
0 85008768 NaN Licensed Insurance Agent Full-time Chico, CA 0.0 0.0 5.0 1 NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 133114754 77766802.0 Sales Manager Full-time Santa Clarita, CA 0.0 0.0 0.0 0 15.0 ... NaN CargoLogin. CargoLogin is a freight and logistics company ... 1.0 California US Santa Clarita 28358 Constellation Rd https://www.linkedin.com/company/cargologinllc
2 133196985 1089558.0 Model Risk Auditor Contract New York, NY 1.0 0.0 17.0 0 48.0 ... NaN Employvision Inc. Employvision is a talent acquisition and manag... 1.0 New Jersey US Somerset 08873 285 Davidson Ave https://www.linkedin.com/company/employvision
3 381055942 96654609.0 Business Manager Full-time Forney, TX 0.0 0.0 0.0 0 0.0 ... NaN First Baptist Church Forney NaN 2.0 Texas US Forney 75126 1003 College St https://www.linkedin.com/company/fbc-forney
4 529257371 1244539.0 NY Studio Assistant Full-time New York, NY 0.0 0.0 2.0 1 87.0 ... IT Ken Fulk Inc Ken Fulk Inc, is a creative think tank based o... 1.0 California US San Francisco 94103 310 Seventh Street https://www.linkedin.com/company/ken-fulk-inc.

5 rows × 23 columns

While going through the data we found that the location column is not consistent. We identified 5 cases: -

1] Only the city is given

2] Only the country is given

3] The city and the state are given

4] The state and country are given

5] The city, state and the country are given

We decided to keep all the datapoints in the location column to be in city-state format. In order to do this, we have done processing of the location data.

In [22]:
# counting the unique values in "Location" Column
df_job['location'].value_counts()
Out[22]:
United States     1133
New York, NY       398
Chicago, IL        267
Houston, TX        243
Atlanta, GA        207
                  ... 
Visalia, CA          1
Page, AZ             1
Blacksburg, VA       1
Windsor, WI          1
Vicksburg, MS        1
Name: location, Length: 3010, dtype: int64

In order to make the data as consistent as possible, certain locations need to be renamed. A dictionary is created where the locations that need to be changed are entered along with the new name with which they want to replace.

In [23]:
# creating a new dictionary where certain locations (key) are given their new name (value)
new_location = {'United States': 'United States, US', 'New York City Metropolitan Area': 'New York City Metropolitan Area,NY', 'San Francisco Bay Area': 'San Francisco Bay Area, CA',
                'Los Angeles Metropolitan Area' : 'Los Angeles Metropolitan Area, CA', 'Washington DC-Baltimore Area': 'Washington DC-Baltimore Area, DC',
                'Dallas-Fort Worth Metroplex': 'Dallas-Fort Worth Metroplex, TX', 'Atlanta Metropolitan Area': 'Atlanta Metropolitan Area,GA',
               'Greater Chicago Area': 'Greater Chicago Area, IL', 'Greater Houston': 'Greater Houston, TX', 'San Diego Metropolitan Area':'San Diego Metropolitan Area, CA',
               'Greater Minneapolis-St. Paul Area': 'Greater Minneapolis-St. Paul Area, MN'}
# using .replace to rename the selected locations in the new_location dictionary
df_job['location'].replace(new_location,inplace=True)
In [24]:
# Filtering to keep only rows where "location" contains "," and creating a saperate copy
jobs= df_job[df_job['location'].str.contains(',')==True].copy()
In [25]:
# Slice the State Abbreviations from the "Location" column
jobs['State'] =  jobs['location'].map(lambda s: s[-2:])
jobs['City'] = jobs['location'].map(lambda c: c[:-2]).str.replace(',','')
jobs.columns
Out[25]:
Index(['job_id', 'company_id', 'title', 'formatted_work_type', 'location',
       'applies', 'remote_allowed', 'views', 'sponsored', 'employee_count',
       'follower_count', 'Skill 1', 'Skill 2', 'Skill 3', 'name',
       'description', 'company_size', 'state', 'country', 'city', 'zip_code',
       'address', 'url', 'State', 'City'],
      dtype='object')
In [26]:
# Check the values in "State" Column
jobs['State'].value_counts()
Out[26]:
CA    1974
TX    1360
US    1133
NY     919
FL     757
es     717
IL     590
GA     482
NC     449
PA     441
MA     438
VA     436
WA     408
OH     397
NJ     386
MI     329
AZ     311
MD     310
MN     306
CO     287
TN     259
SC     233
WI     230
MO     210
IN     186
DC     159
CT     145
UT     141
OR     127
NV     124
ea     120
OK     111
IA     110
KS      98
LA      97
AL      95
KY      92
ID      68
AR      66
DE      56
NE      54
NM      49
MS      45
RI      40
NH      35
WV      34
HI      33
AK      31
ND      29
MT      20
SD      20
WY      18
ME      15
VT      13
Name: State, dtype: int64

After further checking the values in "State" and "City" Columns, we discovered that "State" Column has some value which are not states like 'ea', 'es', while we can find the state information in the corresponding "City" Column. So we define a funtion called "findstate" to find the information of states based on the information in the "City" column, to see if there is a match the 50 states of United States, if we do, we return it to the list, else we return 'NA'. Then we can update the "State" Column value and change it into state abbreviations for further analysis.

In [27]:
# creating a dictionary "state_abb" mapping US states to their respective abbrivations
state_abb = {'Alaska': 'AK', 'Alabama': 'AL', 'Arkansas': 'AR', 'Arizona': 'AZ', 'California': 'CA', 'Colorado': 'CO', 'Connecticut': 'CT',
          'District of Columbia': 'DC', 'Delaware': 'DE', 'Florida': 'FL', 'Georgia': 'GA', 'Hawaii': 'HI', 'Iowa': 'IA', 'Idaho': 'ID',
          'Illinois': 'IL', 'Indiana': 'IN', 'Kansas': 'KS', 'Kentucky': 'KY', 'Louisiana': 'LA', 'Massachusetts': 'MA', 'Maryland': 'MD',
          'Maine': 'ME', 'Michigan': 'MI', 'Minnesota': 'MN', 'Missouri': 'MO', 'Mississippi': 'MS', 'Montana': 'MT', 'North Carolina': 'NC',
          'North Dakota': 'ND', 'Nebraska': 'NE', 'New Hampshire': 'NH', 'New Jersey': 'NJ', 'New Mexico': 'NM', 'Nevada': 'NV',
          'New York': 'NY', 'Ohio': 'OH', 'Oklahoma': 'OK', 'Oregon': 'OR', 'Pennsylvania': 'PA', 'Rhode Island': 'RI', 'South Carolina': 'SC',
          'South Dakota': 'SD', 'Tennessee': 'TN', 'Texas': 'TX', 'Utah': 'UT', 'Virginia': 'VA', 'Vermont': 'VT', 'Washington': 'WA',
          'Wisconsin': 'WI', 'West Virginia': 'WV', 'Wyoming': 'WY'}
In [28]:
# creating a dictionary "state" which has all us states and territories
state = ['Alaska', 'Alabama', 'Arkansas', 'Arizona', 'California', 'Colorado', 'Connecticut', 'District of Columbia', 'Delaware',
         'Florida', 'Georgia', 'Hawaii', 'Iowa', 'Idaho', 'Illinois', 'Indiana', 'Kansas', 'Kentucky', 'Louisiana', 'Massachusetts',
         'Maryland', 'Maine', 'Michigan', 'Minnesota', 'Missouri', 'Mississippi', 'Montana', 'North Carolina', 'North Dakota', 'Nebraska',
         'New Hampshire', 'New Jersey', 'New Mexico', 'Nevada', 'New York', 'Ohio', 'Oklahoma', 'Oregon', 'Pennsylvania', 'Rhode Island',
         'South Carolina', 'South Dakota', 'Tennessee', 'Texas', 'Utah', 'Virginia', 'Vermont', 'Washington', 'Wisconsin', 'West Virginia',
         'Wyoming']
In [29]:
# 'ea' and 'es' are filted subsets of the job dataframe, where state is 'ea' or 'es' respectively
ea =jobs[jobs['State']=='ea'].replace('ea','')
es = jobs[jobs['State']=='es'].replace('es','')
In [30]:
# creating a dataframe concatinating "ea" and "es" dataframes
df_state = pd.concat([ea, es], axis=0)
In [31]:
# Creating a function that that identifies and loop through each row in "City" Column, 
# split the value into words and loop through the words see if there is a match with the 50 states of United States
# Appends the found state name or "NA" to the result list if no state name us detected.
def findstate(s, state):
    result = []
    for value in s:
        items = value.split(' ')
        found_state = False
        for item in items:
            if item in state:
                result.append(item)
                found_state = True
        if not found_state:
            result.append('NA') # Appends the found state name or "NA" to the result list if no state name us detected.
    return result
In [32]:
# Assigning new state names to "state" column in "df_state" using the findstate function on the "city" column
df_state['State'] =findstate(df_state['City'],state)
In [33]:
# replacing state names with their respective abbrivations using our "state_abb" dictionary.
df_state['State'].replace(state_abb, inplace=True)
In [34]:
# creating a new column and copying the values from the state
df_state['job_state']= df_state['State']
In [35]:
# mearging jobs dataframe with "df" on "job_id" using a left join, resluting in a new dataframe
merge_jobs = pd.merge(jobs, df_state[['job_id','job_state']], on='job_id',how='left')
In [36]:
# updating "state" in "merge_job" to "job_state" where "job_state" is not missing, otherwise keeping the original
merge_jobs['State'] = np.where(merge_jobs['job_state'].notna(), merge_jobs['job_state'], merge_jobs['State'])
In [37]:
# Dropping the column after update
merge_jobs.drop('job_state',axis=1,inplace=True)
In [38]:
# Check the values in "State" Column
merge_jobs['State'].value_counts().head()
Out[38]:
CA    2078
TX    1451
US    1133
NY     919
FL     798
Name: State, dtype: int64

To do our analysis, we need merge another dataset called "company_industries" to our dataframe to get the industry information of each company.

In [39]:
# reading data from "company_industries.csv" and dropping duplicates
company_industry = pd.read_csv('company_industries.csv').drop_duplicates()
In [40]:
# Merging the dataframes "merge_jobs" and "company_industry" on "company_id" using a left join
df_jobs= pd.merge(merge_jobs, company_industry, on= 'company_id', how='left')

Select the columns we need for the final dataframe we will use for the analysis work, which are 'title','City','State','formatted_work_type','remote_allowed', 'applies','views', 'sponsored','Skill 1', 'Skill 2', 'Skill 3','name','employee_count','follower_count','industry','city','state', then we rename the names of the columns to make it clearer and easier to understand.

In [41]:
# Creaing a job_positions dataframe selecting specific columns from df_jobs and making a separate copy.
job_postings = (df_jobs[['title','City','State','formatted_work_type','remote_allowed', 'applies','views', 'sponsored','Skill 1', 'Skill 2', 'Skill 3','name','employee_count','follower_count','industry','city','state']]).copy()
In [42]:
# Defining a dictionary "new_columns" to map existing column names to new, more descriptive names of columns
new_columns = {'title': 'Job_Title','City':'Job_City','State':'Job_State','formatted_work_type':'Job_Type', 'applies':'Applies', 'remote_allowed': 'Remote', 'views': 'Views', 'sponsored': 'Sponsored',
          'name':'Company_Name', 'employee_count':'Employees', 'follower_count': 'Followers', 'industry': 'Company_Industry', 'city':'Company_City', 'state':'Company_State'}
In [43]:
# Renaming the names of the columns to new columns
job_postings.rename(columns=new_columns,inplace=True)

At last, we finalize our dataframe by doing some more data processing such as fill in the missing values and change the data type.

Fill in the missing values¶

We assue each company which posted the jobs will have at least one employee, so we fill in the missing values in "Employee" Column with 1; We assue each company with a missing value in "Follower" have 0 followers in Likedin, so we fill in the missing values in "Follower" Column with 0.

Change data type¶

We change the data type in Columns "Remote" "Applies" "Views" "Employees" and "Followers" to integer to make the dataframe look clearer and more consistent.

In [44]:
# Counting the nulls in the dataframe
job_postings.isnull().sum()
Out[44]:
Job_Title               0
Job_City                0
Job_State               0
Job_Type                0
Remote                  0
Applies                 0
Views                   0
Sponsored               0
Skill 1               618
Skill 2              6421
Skill 3             13298
Company_Name          399
Employees             353
Followers             353
Company_Industry      380
Company_City          399
Company_State         402
dtype: int64
In [45]:
# Filling the missing values to 1, updating the dataframe inplace
job_postings.Employees.fillna(1,inplace=True)
job_postings.Employees = job_postings.Employees.replace(0,1)
In [46]:
# Filling missing values in the "Followers" column of "job_postings" with 0, applying the changes directly to the dataframe.
job_postings.Followers.fillna(0,inplace=True)
In [47]:
# Converting the datatypes of the following columns to integers.
job_postings[['Remote','Applies','Views','Employees','Followers']] =job_postings[['Remote','Applies','Views','Employees','Followers']].astype(int)

Final Dataset¶

After processing the data, the final dataset is ready for analysis. The final dataset is a combination of 5 datasets consisting of only the required and necessary columns. The dataset has 15593 rows and 17 columns. The dataset gives details on the job postings, the skills required, whether the job offers remote work, and the details about the companies.

In [48]:
# Printing the dataframe
job_postings.head()
Out[48]:
Job_Title Job_City Job_State Job_Type Remote Applies Views Sponsored Skill 1 Skill 2 Skill 3 Company_Name Employees Followers Company_Industry Company_City Company_State
0 Licensed Insurance Agent Chico CA Full-time 0 0 5 1 SALE BD NaN NaN 1 0 NaN NaN NaN
1 Sales Manager Santa Clarita CA Full-time 0 0 0 0 SALE BD NaN CargoLogin. 15 159 Transportation/Trucking/Railroad Santa Clarita California
2 Model Risk Auditor New York NY Contract 0 1 17 0 ACCT FIN NaN Employvision Inc. 48 14476 Staffing & Recruiting Somerset New Jersey
3 Business Manager Forney TX Full-time 0 0 0 0 NaN NaN NaN First Baptist Church Forney 1 0 Religious Institutions Forney Texas
4 NY Studio Assistant New York NY Full-time 0 0 2 1 DSGN ART IT Ken Fulk Inc 87 9790 Design San Francisco California

Data Quality Check¶

After we have finished the processing we can see that the data is pretty much consistent. We have identified and handled the missing values and redundancy in our data set using filtering, imputing and merging the datasets. We have achieved some of the following things with the data cleaning:

  • Integrity
  • Consistency
  • Correctness of datatypes
  • Standardization
  • Reduced Redundancy

However, we will go through the data again and try to refine it and make it more consistent.

In [49]:
# Check the missing value of the dataframe
job_postings.isnull().sum()
Out[49]:
Job_Title               0
Job_City                0
Job_State               0
Job_Type                0
Remote                  0
Applies                 0
Views                   0
Sponsored               0
Skill 1               618
Skill 2              6421
Skill 3             13298
Company_Name          399
Employees               0
Followers               0
Company_Industry      380
Company_City          399
Company_State         402
dtype: int64

Let's see some basic situation of the dataset.¶

In [50]:
# Find out the Job Type distribution
job_postings['Job_Type'].value_counts().to_frame()
Out[50]:
Job_Type
Full-time 12593
Contract 1722
Part-time 988
Temporary 119
Internship 111
Other 52
Volunteer 8
In [51]:
# Find out the Sponsorship distribution
job_postings['Sponsored'].value_counts().to_frame()
Out[51]:
Sponsored
0 11086
1 4507
In [52]:
# Find out the Top 10 industries which have the most job postings
job_postings['Company_Industry'].value_counts().to_frame().head()
Out[52]:
Company_Industry
Staffing & Recruiting 2184
Information Technology & Services 1754
Hospital & Health Care 1367
Retail 954
Computer Software 855

Question 1¶

How has the prevalence of remote work in job postings changed at different locations and are there differences between the industries?

Why is the question important?¶

Addressing the question how the prevalence of remote work in job postings has changed across different locations and industries is crucial for job seekers in today’s evolving job market for the following reasons:

  1. It will help job seekers understand the dynamics of the remote work market and help them align their strategies with industries and locations where remote work is prevalent.

  2. Provides insights into shifting nature of the work environment, enabling job seekers to make informed decisions about relocating or pursuing career in a different sector.

  3. Studies have shown that remote work have great positive impact on the work life balance. Analysing these trends is vital for job seekers in a job market with increasingly influenced by remote work possibilities.

https://www.forbes.com/sites/glebtsipursky/2022/11/01/does-remote-work-hurt-wellbeing-and-work-life-balance/?sh=ebaf7ba4b204

What are we doing about it?¶

In our analysis, we aim to quantify the rise of remote work opportunities, revealing how this trend intersects with industry sectors and geographic regions. By calculating and visualising the percentage of remote work offerings, we provide job seekers with actionable data to navigate the job market effectively.

In [53]:
# Calculating the percentage of Job postings that allow remote work
remote_jobs_percentage = job_postings['Remote'].mean() * 100
In [54]:
# Displaying the percentage
print(f'The percentage of Job Posting that allow remote work is {remote_jobs_percentage:.2f}%')
The percentage of Job Posting that allow remote work is 14.65%
In [55]:
# Calculating the percentage of jobs that allow remote work with respect to each industry
remote_jobs_industries = job_postings.groupby('Company_Industry')['Remote'].mean().sort_values(ascending = False) * 100
In [56]:
# Displaying the data
top_10 = 10
remote_jobs_industries.head(top_10)
Out[56]:
Company_Industry
Fine Art                            100.000000
Think Tanks                         100.000000
Professional Training & Coaching     85.714286
Writing & Editing                    75.000000
E-learning                           71.428571
Market Research                      66.666667
Political Organization               66.666667
Venture Capital & Private Equity     65.000000
Investment Management                50.000000
Computer Hardware                    40.000000
Name: Remote, dtype: float64

Visualizations¶

In this question, the prevalence of remote work in different states and across various indsutries is being discussed. In order to display it visually, we use Bar Chart to show the prevalence of remote work across the top 10 industries and a choropleth map of the United States to show the prevalence of remote work in different states

Why take a Bar Chart?

  • Bar Chart though very simple, can effectively do a side=by-side comparision of different industries and make it easy to identify the sectors wiht high instances of remote jobs while being understood by all.
  • Bar Chart will display the percentage of the remote work prevalent in the industry. This makes the assessment very easy.
  • Since it is undeerstood by all, the analysis can be accessible to a wide range of audience.

Why take a Choropleth Map?

In order to visualize the prevalence of remote work in different states, Choropleth Map is the best way to visualize this.

  • Choropleth Map will display the entire US map with all the 50 US states thus it will highlight the distribution of remote work state wise.
  • The Map uses different shades of the same colout to represent the different level of prevalence.
  • This makes the map visually appealing and easy for the person viewing it to identify the states with high or low remote work oppurtunities.
In [57]:
# Importing plotly
import plotly.express as px
In [58]:
# Resetting the Index for top 30 records
top_30_remote_jobs_industries = remote_jobs_industries.reset_index().head(30)
In [59]:
# Plotting the graph to display the percentage of remote work availability
bar_chart = px.bar(top_30_remote_jobs_industries,x = 'Company_Industry', y='Remote', title='Remote Work Availability by Industry')
bar_chart.update_layout(width = 1000, height = 600, xaxis_title='Industry', yaxis_title='Percentage Remote Work Availability')
bar_chart.update_xaxes(tickangle=45)
bar_chart.show()
In [60]:
# Calculating the percentage of jobs that allow remote work for each state
remote_work_by_state = (job_postings.groupby('Job_State')['Remote'].mean() * 100).reset_index(name = 'Proportion')
In [61]:
# Creating a choropleth map
remote_work_by_state_map = px.choropleth(
    remote_work_by_state,
    locations='Job_State',
    locationmode='USA-states',
    color='Proportion',
    color_continuous_scale="blues",
    title='Percentage of Job Postings with Remote Work by State',
    labels={'percentage': 'Percentage of Job Postings by State'},
    range_color=(0, 40)
)
In [62]:
# Update the layout to include a US map
remote_work_by_state_map.update_geos(scope='usa')
remote_work_by_state_map.update_layout(width=1000, height=600)

# Displaying the map
remote_work_by_state_map.show()

Wyoming, Maine and Missouri are the leading states in terms of remote work proportion and these are our 3 inferences from the observation:

1. Quality of life as a remote work driver:

  • Maine’s high quality of life and natural beauty have made it attractive for remote workers
  • Remote workers are drawn to Wyoming for its outdoor life style and overall quality of life.
  • The state of Missouri initiatives to allow remote work for state employees indicate a focus on improving quality of life and job satisfaction.

2. Impact of population and geography as a remote work driver:

  • Maine is addressing its talent shortage and attracting out of state talent through remote work.

3. Economic Diversity as remote work driver:

  • Missouri’s approach to remote work policies and bills suggest an alignment with broader economic goals and diversification.
  • Remote workers in Wyoming contribute unique skill set than typically present in the state and thus increasing labour diversity.

Question 2¶

Which job type has the highest number of postings and what jobs are being applied to the most, is there a gap between the demand and supply and why is that?

Why is this question important?¶

This question answers the job type with the highest number of job postings and which has the most number of applications. This helps with the following:

1] Identification of in-demand jobs: - The questions helps us to identify the job which is being preferred by a lot of people.

2] Helps the HR and companies to build strategies: - It gives insights to the HR and the companies to identify which job types are in high demand but low in supply or vice versa. This can help help them understand these dynamics in order to come up with effective workforce planning.

3] Guidance for Job-Seekers and Careers Changers: - It guides the individuals who are looking for jobs or attempting to change career paths with the skills needed in order to be competitive for a in-demand job.

What are we doing about it?¶

Our analysis targets understanding the labour market dynamics by examining job postings and application volumes across various industries. By aggregating and comparing the number of job postings (demand) and applications received (supply), we aim to identify which industries are experiencing high demand for labour and those attracting the most applicants. We are visualzing the relation between top 10 industries by their ratio of applications to job postings. This analysis aims to offer valuable insights for job seekers and employers alike, giving them more information to help their decision making.

In [63]:
# Group by Company_Industry and calculate the sum of Applies and count of Job_Title for each Industry
grouped_df = job_postings.groupby('Company_Industry').agg({'Applies': 'sum', 'Job_Title': 'count'}).reset_index()
In [64]:
# Rename the columns
grouped_df = grouped_df.rename(columns={'Applies': 'total_applies', 'Job_Title': 'postings_count'})
In [65]:
# Display the grouped dataframe
grouped_df.head()
Out[65]:
Company_Industry total_applies postings_count
0 Accounting 630 115
1 Airlines/Aviation 107 14
2 Alternative Medicine 1 2
3 Animation 1 1
4 Apparel & Fashion 1819 116

Job Type-Highest Number of Postings¶

In [66]:
#Sort the dataframe by postings_count in descending order
sorted_postings=grouped_df.sort_values(by='postings_count', ascending=False)
In [67]:
#Display the job type with the highest number of postings
highest_job_postings=sorted_postings.iloc[0]['Company_Industry']
In [68]:
#Display the count of the highest postings
highest_postings_count=sorted_postings.iloc[0]['postings_count']
print(f'The job type with the highest number of postings is: {highest_job_postings} with {highest_postings_count} postings ')
The job type with the highest number of postings is: Staffing & Recruiting with 2184 postings 

Since we see that the highest number of job postings is Staffing and Recruiting we can infer that these job type is an essential function to many industries. As businesses grow and experience turnover there is always a need to fill in these positions. Thus this constant demand is why Staffing and Recruiting has a high volume of job postings.

Job Type-Highest Number of Applies¶

In [69]:
#Sort the dataframe by total_applies in descending order
sorted_applies=grouped_df.sort_values(by='total_applies', ascending=False)
In [70]:
#Display the job type with the highest number of applicants
highest_job_applies=sorted_postings.iloc[0]['Company_Industry']
In [71]:
#Display the count of the highest applicants
highest_total_applies_count=sorted_applies.iloc[0]['total_applies']
print(f'The job type with the highest number of applicants is: {highest_job_applies} with {highest_total_applies_count} applicants ')
The job type with the highest number of applicants is: Staffing & Recruiting with 42912 applicants 

Since we see that the highest number of job applicants is for Staffing and Recruiting we can infer that there is always a constant supply of people that want to fill in for these roles. This is probably because you do not need much experience when it comes to applying for these jobs. Any person of any age and at any point of their life can apply and work in most staffing and recuriting postions.

In [72]:
#Create a new column and store the ratio of total_applies to postings_count
grouped_df['ratio_applies_postings']=grouped_df['total_applies']/grouped_df['postings_count']
In [73]:
#Increment index values to start the index from 1
grouped_df.index=grouped_df.index+1
In [74]:
# Displaying the data
grouped_df.head()
Out[74]:
Company_Industry total_applies postings_count ratio_applies_postings
1 Accounting 630 115 5.478261
2 Airlines/Aviation 107 14 7.642857
3 Alternative Medicine 1 2 0.500000
4 Animation 1 1 1.000000
5 Apparel & Fashion 1819 116 15.681034
In [75]:
#Sort grouped_df to show the ratio in descending order
df_sorted_ratio=grouped_df.sort_values(by='ratio_applies_postings', ascending=False)
In [76]:
#Display the top 10 industries with highest ratios
df_sorted_ratio.head(top_10)
Out[76]:
Company_Industry total_applies postings_count ratio_applies_postings
67 Investment Management 158 2 79.000000
140 Writing & Editing 286 4 71.500000
129 Think Tanks 109 2 54.500000
52 Health, Wellness & Fitness 4734 91 52.021978
28 Cosmetics 1428 28 51.000000
21 Computer Hardware 231 5 46.200000
134 Venture Capital & Private Equity 809 20 40.450000
72 Legislative Office 31 1 31.000000
117 Research 1480 57 25.964912
16 Civic & Social Organization 308 12 25.666667

Why are we using a line graph?¶

We are using a line graph because it is useful for showing trends or changes over a continuous variable such as these different industries. The line graph allows us to highlight varations in the data over these specific categories.

Assumptions¶

Supply: In this case our supply is the amount of applicants which refers to the total_applies column in the dataframe. From the industry's point of view the demand that particular industry has for more employees is the amount of postings they have put up.

Demand: In this case our demand is the amount of postings which refers to the postings_count column in the dataframe. From the industry's point of view the supply of the applicants is the number of total applicants.

Since the percentage of supply (total_applies) is greater than 1 there is more of a supply for that particular industry compared to the demand of the industry.

In [77]:
# Select the top 10 values for the line graph
top_10 = df_sorted_ratio.head(10).copy()

# Map emojis to the respective industry's ratio
emojis = ['🏦', '📝', '💡', '🏋️‍♀️', '💄', '💻', '💸', '🧑‍⚖️', '🏛️', '🔬']

# Add emojis to the 'text' column
top_10.loc[:, 'text'] = [f"{emoji} {percentage:.2f}" for emoji, percentage in zip(emojis, top_10['ratio_applies_postings'])]

# Create the line graph
fig = px.line(top_10,
              x='Company_Industry',
              y='ratio_applies_postings',
              text='text',
              labels={'ratio_applies_postings': 'Ratio (Applies to Posting)'},
              title='Top 10 Industries by Highest Ratio',
              height=700)

# Edit layout
fig.update_layout(
    xaxis_title='Company Industry',
    yaxis_title='Ratio (Applicants/Job Postings)',
    plot_bgcolor='black',  
    paper_bgcolor='black', 
    font_color='white',  
    margin=dict(l=10, r=1, t=50, b=50),  

)
fig.update_traces(
    textfont=dict(size=17),  
    line_shape='linear',   
    line=dict(width=2, color='aqua'),   
    textposition='top right'  
)

# Show the interactive plot
fig.show()

The analysis uncovers a lot of interesting insights about demand and supply.

There are two major inferences that we would like to discuss:

  1. High competition in specific industries: The graph suggests that industries like investment management and writing and editing are experiencing a high ratio of applications to job postings, which could indicate a highly competitive job market in these fields. This may be due to surplus of qualified candidates in relation to the number of available positions, or it could reflect a strong desirability of jobs within these industries, leading to many applications for relatively fewer openings. A lot of people take up writing and editing as a side hustle as well. Investment Management and writing and editing can be completely remote which motivates more people to apply.

  2. Potential skill gap: On the other hand, industries such as civic and social organization and research have a lower ratio of applications to job postings. This could infer that there is skills gap where the number of available jobs exceed the number of applicants who possess the required qualifications, or it might suggest less interest in there sectors among job seekers. For HR and strategic planning, these industries might need to invest more in training programs to build the required workforce or in recruitment marketing to attract more applicants.

Question 3¶

How does the traction (viewed) compare to engagement (applied) look like and what does tell about different companies Size?¶

Why we want to explore this perspective of the job postings? As graduate students, we will soon be job seekers. The number of times a job posting is viewed by potential candidates can be considered as traction. The number of applications submitted for a particular job reflects engagement. Higher traction and engagement could suggest that the job is attractive to a larger pool of candidates, possibly because of factors like company reputation, job salary, or career growth opportunities. Since we don't have other data or information as salary and career growth oppotunities here, we can look it up through the company reputation. Larger companies tend to be more famous and have more influence in the industries. Some large companies are the dominant players in their own markets, that give higer reputations to them rather than medium to small companies. So we expect job postings from large companies will attract more views and applies due to their brand recognition and resources for extensive job advertising. Medium and small companies might have less traction and engagement, especially if they are newly established or operate in a niche market.

To look into this problem, we first check the range of total numbers of employees of the companies posted jobs, and we discovered that the range is from 1 to 829,111. So we divided companies into 8 company sizes.

  • For companies which has at most 200 employees, their company size will be 1;
  • For companies which has more than 200 but at most 1,000 employees, their company size will be 2;
  • For companies which has more than 1,000 but at most 5,000 employees, their company size will be 3;
  • For companies which has more than 5,000 but at most 10,000 employees, their company size will be 4;
  • For companies which has more than 10,000 but at most 50,000 employees, their company size will be 5;
  • For companies which has more than 50,000 but at most 100,000 employees, their company size will be 6;
  • For companies which has more than 100,000 but at most 300,000 employees, their company size will be 7;
  • For companies which has more than 300,000 but at most 1,000,000 employees, their company size will be 8.

We divided companies into these four categoies for further analysis: small company, medium company, large company and giant company. Since giant companies tend to be influential companies influence or sometimes dominate in their industries, a career with them seems to be more promising, so we expected to find most views and applies of their job postings.

Company Category Company Size Employee Amount Description Reputation Scale
Small Company 1, 2 0-1,000 Newly established or operate in a niche market Little to no
Medium Company 3, 4 1,000-10,000 N/A Medium to little
Large Company 5, 6, 7 10,000-300,000 N/A Tend to be high
Giant Company 8 300,000-1,000,000 Influential companies in their industries Highest with global presence
In [78]:
# Select the columns needed for this analysis and created a new dataframe
view_apply = job_postings[['Job_Title','Job_Type','Job_State','Remote','Sponsored','Company_Name','Employees','Company_Industry','Views','Applies']].copy()
view_apply.head()
Out[78]:
Job_Title Job_Type Job_State Remote Sponsored Company_Name Employees Company_Industry Views Applies
0 Licensed Insurance Agent Full-time CA 0 1 NaN 1 NaN 5 0
1 Sales Manager Full-time CA 0 0 CargoLogin. 15 Transportation/Trucking/Railroad 0 0
2 Model Risk Auditor Contract NY 0 0 Employvision Inc. 48 Staffing & Recruiting 17 1
3 Business Manager Full-time TX 0 0 First Baptist Church Forney 1 Religious Institutions 0 0
4 NY Studio Assistant Full-time NY 0 1 Ken Fulk Inc 87 Design 2 0
In [79]:
# Drop duplicates in the "Employee" Column to find the range of the values
view_apply['Employees'].drop_duplicates().describe()
Out[79]:
count      2501.000000
mean      11034.045182
std       36625.131630
min           1.000000
25%         716.000000
50%        2417.000000
75%        8069.000000
max      829111.000000
Name: Employees, dtype: float64
In [80]:
# Create bins and bin labels for the "Employee" Column to categorize different sizes of the companies
bins = [0, 200, 1000, 5000, 10000, 50000, 100000, 500000, 1000000]
bin_labels = [1, 2, 3, 4, 5, 6, 7, 8]
In [81]:
# Create a new column as "Company_Size" for further analysis
view_apply['Company_Size'] = pd.cut(view_apply['Employees'], bins, labels=bin_labels)
view_apply.head()
Out[81]:
Job_Title Job_Type Job_State Remote Sponsored Company_Name Employees Company_Industry Views Applies Company_Size
0 Licensed Insurance Agent Full-time CA 0 1 NaN 1 NaN 5 0 1
1 Sales Manager Full-time CA 0 0 CargoLogin. 15 Transportation/Trucking/Railroad 0 0 1
2 Model Risk Auditor Contract NY 0 0 Employvision Inc. 48 Staffing & Recruiting 17 1 1
3 Business Manager Full-time TX 0 0 First Baptist Church Forney 1 Religious Institutions 0 0 1
4 NY Studio Assistant Full-time NY 0 1 Ken Fulk Inc 87 Design 2 0 1
In [82]:
# Check the company numbers of each company size
view_apply.Company_Size.value_counts()
Out[82]:
1    4288
3    3192
5    2823
2    2753
4    1360
6     692
7     381
8     104
Name: Company_Size, dtype: int64
In [83]:
# Create the visualization for the views and applies in job postings of different company size
size = view_apply.Company_Size.astype(int) ** 3.5

# creating a scatter plot with views and applies on X and Y axies respectively
view_apply.plot.scatter(x='Views',y='Applies',figsize= (12,6),s=size,c='Company_Size',colormap='viridis')

# setting the limits for better visualisation
plt.axis(ymin=0, ymax=1200, xmin=1, xmax=3000)

# performing a linear regression
m, b = np.polyfit(view_apply.Views, view_apply.Applies, 1)

# plotting x and y (y = mx + b )
plt.plot(view_apply.Views, m*view_apply.Views + b)
Out[83]:
[<matplotlib.lines.Line2D at 0x142431110>]

Inferences¶

From the graph above we can discover several findings:

  1. Basicly most of the job postings has the engagement ratio around 20%;
  2. Most job postings have views less than 1,400 and applies less than 500;
  3. However, we can see that job postings has more than 1,400 views and more than 500 applies are most from small companies.

The third observation didn't fall into our expectation, quite contradict, it was the opposite. Based on this analysis, we discussed and infered that maybe some drawbacks of the large/giant companies and some benefits working in small companies contribute to this situation.

- Higher Requirements for candidates of Large/Giant Companies:¶

As it is well known, job postings from larger companies often have more extensive requirements. They might seek candidates with specific educational backgrounds, years of experience, and a diverse set of skills. THe higher standard of extensive requirements may intimidate job seekers and discourage them from applying for the jobs without even open the job details to review. Moreover, candidates might assume that jobs from larger companies are highly competitive, leading them to believe that their chances of success are slim, thus result in self-selection out due to time and effort concerns.

- More Rules to Follow in Large/Giant Companies which means less Flexibility:¶

Larger organizations often have more structured and formalized processes, including strict policies and procedures. While this can provide stability and consistency, it might also mean less flexibility in certain aspects of work, decision-making, or adapting to individual needs.

- Niche Focus of Small Companies:¶

Small companies often have the ability to focus on niche markets or specialized areas within an industry. This niche focus allows them to tailor their offerings more precisely and attract candidates with specific skills or interests. So it will attract people who are interested in the market, especially if it is emerging.

- Unique Opportunities in Small Companies:¶

Small companies may offer more unique opportunities because there are fewer hierarchical levels in small companies. This company structure offers the employees more chance to involve in decision-making, thus give them more chance to show their skills or talent.

- More Chance to a Variety of Work in Small Companies:¶

In small companies, employees may have more chances to engage in various tasks and projects, contributing to a broader skill set and a more dynamic work experience. This variety can be attractive to individuals who value diversity in their work.

Question 4¶

Overall, which skills are most in demand? And later, give showcase the in-demand skills industry based.

Why is this question important?

The industry-based in-demand skills chart provides a wealth of insights into the labor market's current state and future directions. It highlights key areas for professional development, industry trends, and potential gaps between supply and demand for various skills.

We aim to uncover:

Skill Ecosystem: Skill bubbles by industry reveal growth areas, indicating where job seekers should focus their upskilling efforts.

Career Mapping: Guides professionals on in-demand skills, aligning their career development with market trends.

Strategy and development: Directs educational and work force strategies towards addressing the demand for niche skills.

What are we doing about it?

We are aggregating job posting data to identify the frequency of skills demanded across all industries present in the dataset post which we are finding are the in-demand skills of top 20 booming industries. This involves relevant visulisations of the data that will further allow us to observe which skills sought after and which are more niche.

Visualization¶

In this question, bubble charts are used to visualize the which skills are the most in-demand skill industry wise.

Why take a Bubble Chart?

  • A bubble chart clearly conveys which skill is popular as the size of each bubble depends on the frequency of the skill. If the frequency of the skill in the dataset is more than the size of the bubble will be big and vice versa.
  • It conveys the message easily as the larger bubble will naturally draw more attention thus showing which skill is the most popular and in turn showing which skill is the most in demand.
  • It makes it very easy to compare the frequency of the skills at one moment. This will help in quick assessment of the skill demand in the job market. The chart is visually appealing because of the colours and the different sizes of the bubble.
  • Using Plotly helps to make the chart interactive. The viewers or the presenter can hover over each bubble and get more detailed information from the chart. The chart also uses lesser space and looks more neat than a traditional line or bar graph.
In [84]:
# Extracting relevant columns
skills_columns = ['Skill 1', 'Skill 2', 'Skill 3']
company_columns = ['Job_Title', 'Company_Name']

skills_series = pd.concat([job_postings[col] for col in skills_columns]) # Combining skills from "Skill 1," "Skill 2," and "Skill 3" columns
skills_count = skills_series.value_counts() # Counting the occurrences of each skill
skills_count = skills_count.dropna() # Filtering out NaN values

# Creating a DataFrame for the Bubble Chart
bubble_data = pd.DataFrame({
    'Skill': skills_count.index, # Assigning skill name to skill column
    'Frequency': skills_count.values, # Assigning skill frequencies to frequency column
})

# Creating a Bubble Chart using plotly
fig = px.scatter(bubble_data, x='Skill', y='Frequency', size='Frequency', color='Frequency', # Defining parameters
                 labels={'Frequency': 'Skill Frequency', 'Skill': 'Skill'}, # Customizing axis labels
                 title='In-Demand Skills', # Providing title
                 size_max=80,  # Setting bubble size
                 hover_name=None,
                 template='plotly_dark',  # For Dark Theme
                 color_continuous_scale=px.colors.sequential.Viridis,  # Setting Color Scale
                 )

fig.update_traces(marker=dict(line=dict(width=2, color='white')))  # Setting bubble border width and color

# Customizing the layout
fig.update_layout(
    xaxis=dict(title='Skill', showgrid=False),  # Removing x-axis grid
    yaxis=dict(title='Skill Frequency', showgrid=False),  # Removing y-axis grid
    legend=dict(title='Frequency'),
    title_x=0.5,  # Aligning title
     font=dict(size=13.5, color='white'),  # Setting font style and size
)

fig.show()

Based on this above Visualization, the top required skills amongst all other skills across are:

  • IT
  • Sales
  • Management
  • Manufacturing
  • Engineering

Top-skills Analysis:¶

Information Technology:

The continuous digital transformation of business operations is the reason that IT skills are so ubiquitous across industries. As businesses depend more and more on technological solutions, IT specialists are essential to establishing and maintaining the operation of these systems. IT skills are essential for increasing productivity and maintaining competitiveness in the modern business environment, from data analysis to process automation.

Furthermore, professionals may adapt to the evolving landscape of emerging technologies due to the versatility of IT skills. Artificial intelligence, cloud computing, and expertise in cybersecurity are now especially important since they help companies use cutting-edge technology to their advantage and remain resilient against cyberattacks.

Sales:

In every industry, having strong sales skills is essential for generating revenue and maintaining company expansion. Salespeople are in the forefront of generating revenue and make a substantial financial contribution to a business. Sales abilities are crucial for companies looking to succeed in competitive marketplaces because they facilitate the development and maintenance of customer relationships as well as efforts to expand their market.

Moreover, the dynamic landscape of sales requires adapting to digital channels and utilizing data-driven perspectives to develop more focused and efficient marketing approaches. In today's sales environment, professionals who possess a combination of old-fashioned relationship-building abilities and new, analytics-driven methods are especially valuable.

Management:

Every organization needs effective management to run smoothly and be successful. To accomplish organizational objectives, competent managers demonstrate leadership, make strategic choices, and maximize resources. Managing a company's overall health and direction requires a wide range of competencies, from team leadership to strategic planning.

Furthermore, competent managers possess the ability to cultivate a favourable work environment, encourage staff involvement, and overcome obstacles with courage. Effective management also helps organizations succeed by encouraging innovation within teams, implementing efficient workflows, and being able to adapt to change.

Manufacturing:

Production, operations, and supply chain management skills are essential in manufacturing industries. It is essential to be able to guarantee product quality, streamline production procedures, and innovate in response to market demands. Manufacturing experts make a big difference in a company's capacity to maintain its competitiveness and efficiently produce high-quality products.

Engineering:

Innovative thinking, creative problem-solving, and design are all characteristics of engineering skills. When it comes to fostering innovation and advancing the development of new products, engineers are essential. In a variety of industries, engineering skills are crucial for everything from creating unusual solutions to overcoming difficult problems. Furthermore, engineers frequently take on project management duties to guarantee the smooth running of projects from the beginning to completion.

Modern engineering is interdisciplinary, so working with various teams—such as data scientists, business analysts, and marketers—is also necessary. Innovative projects and solutions are successfully implemented when engineers are able to collaborate across disciplines and effectively communicate complex technical concepts.

Showcasing the in-demand skills industry based¶

In [85]:
# Grouping DataFrame by 'job_id' and 'industry'
industry_job_counts = df_jobs.groupby(['job_id', 'industry']).size().reset_index(name='Job_Count')

# Calculating size of each group, calculating sum of job count, resetting index, and sorting values in descending order
booming_industries = industry_job_counts.groupby('industry')['Job_Count'].sum().sort_values(ascending=False).head(20)

# Extracting the skills and industry columns
skills_columns = ['Skill 1', 'Skill 2', 'Skill 3']
industry_column = 'Company_Industry'

top_skills_data = pd.DataFrame(columns=['Skill', 'Frequency', 'Industry']) # Creating an empty DataFrame to store the top skills for each industry

skills_data = pd.DataFrame(columns=['Skill', 'Frequency', 'Industry']) # Creating an empty DataFrame to store skills data for each industry

# Iterating over each booming industry, We use loop to examine each booming industry within the dataset. 
# For every industry we construct a DataFrame to catalog the skills demanded in job postings. 
for industry in booming_industries.index:
    # Creating a DataFrame for the specific industry
    industry_data = pd.DataFrame({
        # The stack() method is used to transform a two-dimensional data structure into a one-dimensional series
        # value_counts() generates a count of unique skills and their respective frequencies.
        'Skill': job_postings.loc[job_postings[industry_column] == industry, skills_columns].stack().value_counts().index,
        'Frequency': job_postings.loc[job_postings[industry_column] == industry, skills_columns].stack().value_counts().values,
        'Industry': industry
    })

    # Checking if the industry_data is not empty
    if not industry_data.empty:
        # Concatenating the industry data to the DataFrame
        skills_data = pd.concat([skills_data, industry_data], ignore_index=True)

skills_data['Frequency'] = pd.to_numeric(skills_data['Frequency']) # Converting 'Frequency' column to numeric type

top_skills_data = skills_data.loc[skills_data.groupby('Industry')['Frequency'].idxmax()] # Finding the top skill for each industry

# Creating a Bubble Chart using plotly
fig = px.scatter(top_skills_data, x='Skill', y='Frequency', size='Frequency', color='Industry', # Defining parameters
                 labels={'Frequency': 'Skill Frequency', 'Skill': 'Top Skill'}, # Customizing axis labels
                 title='Top Skills for Top 20 Booming Industries', # Setting title
                 size_max=90,  # Setting bubble size
                 template='plotly_dark',  # Dark theme
                 color_continuous_scale=px.colors.sequential.Plasma,  # Color scale
                 )

fig.update_traces(marker=dict(line=dict(width=2, color='white')))  # Setting bubble border width and color

fig.update_layout(
    xaxis=dict(title='Skill', showgrid=False),  # Removing x-axis grid
    yaxis=dict(title='Skill Frequency', showgrid=False),  # Removing y-axis grid
    legend=dict(title='Industry'),
    title_x=0.5,
    font=dict(family='Arial', size=14, color='black'),  # Setting font style, size, and color
    paper_bgcolor='rgba(0,0,0,0)',  # For transparent background
    plot_bgcolor='black',  # For transparent plot area
)
# Showing the plot
fig.show()

Inference to be written for the for top 4 booming industries¶

In [86]:
industry_job_counts = df_jobs.groupby(['job_id', 'industry']).size().reset_index(name='Job_Count') #Grouping Dataframe by 'job_id' and 'industry'

# Calculating size of each group, resetting index, calculating sum of job count and sorting value in desc order
booming_industries = industry_job_counts.groupby('industry')['Job_Count'].sum().sort_values(ascending=False)

booming_industries.head(4)
Out[86]:
industry
Staffing & Recruiting                2184
Information Technology & Services    1754
Hospital & Health Care               1367
Retail                                954
Name: Job_Count, dtype: int64

This analysis above uncovers a lot of interesting insights about the skills required in industries. From here, we can infer which skills are most required in an independent booming industry. Starting with:

Staffing & Recruiting:

The skill of 'Information Technology' is the most required skill here amongst all other skills, as technology plays a major role in the Staffing & Recruiting industry's many operations, such as applicant tracking, client management, and candidate sourcing. Proficiency in "Information Technology" is imperative for effectively navigating and utilizing the current technology tools and platforms within the industry. Additionally, the need for professionals with IT skills is rising as a result of the increased emphasis on digital transformation across industries. Staffing and recruiting companies must keep up with new technologies in order to remain competitive, which makes having IT expertise essential. One more reason is that staffing professionals can assess candidates suitability for particular roles and gain a better understanding of client needs and industry trends by having IT skills. They can also communicate with IT professionals more effectively, learn about their career goals, and evaluate them. IT professionals can offer insightful thoughts on the constantly changing information technology landscape. Having this knowledge will help us stay up to date on talent availability, compensation expectations, and industry trends.

Information Technology & Services:

Here too the skill of 'Information Technology' is the most required skill here amongst all other skills, as digital solutions, software, and technology are at the core of the information technology and services sector. Therefore, in order for professionals in this area to comprehend and contribute to the fundamental functions of information technology, they must possess a firm foundation in the field. The demand for it is further driven by the fact that the IT & Services sector provides a wide range of services, such as cybersecurity, cloud computing, software development, and system integration, and the field of "information technology" is broad and versatile, covering a variety of industry domains. Also, for efficient client engagement and communication, a thorough understanding of IT is essential. Innovation and problem-solving are key components of the industry's success, and IT specialists are essential in advancing technology and tackling challenging issues. Their knowledge is crucial for utilizing cutting-edge technology like artificial intelligence and machine learning algorithms to examine enormous databases in search of anomalies and inconsistencies that could point to fraudulent activity.

Hospital & Health Care:

The 'Healthcare and Public Relations' skill set is highly sought after in this area since it is essential to establishing positive relationships, promoting effective communication, and establishing trust within the healthcare ecosystem. They can also control the flow of information, respond to worries, and reassure the public, all of which strengthen the healthcare organization's overall resilience. In order to promote a positive perception of the healthcare facility, HCPR professionals are essential in the development and execution of outreach programs, health education efforts, and community involvement activities.Finally, it involves collaborating together with outside parties to advance healthcare legislation, support public health programs, and increase awareness, including advocacy groups, government agencies, and the media.

Retail:

Proficiency in sales is essential for generating revenue. Salespeople are essential to the transaction-driven retail industry because they turn leads into customers, which directly affects the bottom line of these companies. Their proficiency in attracting and keeping customers is what keeps them profitable and competitive in the market. Professionals in sales also have a thorough awareness of the goods and services they represent. This information not only helps customers make well-informed decisions, but it also promotes customer pleasure and trust, all of which are essential for creating enduring customer relationships. Additionally, salespeople play a critical role in understanding clients needs, developing good and customized relationships with them, and making recommendations that are specifically designed to meet those needs. By doing this, they greatly increase client loyalty and pleasure. Sales personnel now have to negotiate online platforms due to the growth of e-commerce, which is a factor in the digital transformation of retail organizations. A well-rounded sales professional is able to switch between digital and physical sales channels with ease.

Conclusion¶

Our comprehensive analysis of job postings data from LinkedIn has provided valuable insights into remote work trends, in-demand skills, and the alignment between job visibility and application engagement. In the course of this analysis we have identified the job types with the highest demand and those receiving the most applications, revealing critical gaps between labor supply and market needs. This project also dwelt into dynamic nature of the job market but also offered actionable data that can empower job seekers, guide companies in talent acquisition, and inform educational institutions for curriculum development to bridge skill gaps.

Overall, this project serves as a strategic tool for all stakeholders in the ecosystem in navigating the complexities of the current landscape of the job market.

Pledge¶

"We pledge on my honor that we have not given nor received any unauthorized assistance on this project."

-- Group9-Jobless Jesters